Project 1. Predicting Walmart Sales¶
What is it?¶
Predicting the weekly sales of Walmart stores based on data from 45 different locations from 2010 to 2012.
Why was it made?¶
What's interesting about this project is how none of the factors considered are directly related to store sales/performance, but rather the circumstances of the region in which the stores are located.
To that point, the factors considered include:¶
- The average air temperature in the region [F]
- The cost of fuel in the region [USD/gallon]
- That week's consumer price index [USD]
- The unemployment rate in the region [%]
- Whether a holiday occured in that week
Along with engineered factors:¶
- the month the week happened in
- the week's number within the month
- the week's number within the year
- the differences between the current week's circumstances and the previous week's
Information about which store the data came from was also provided in the dataset, which may help train the model to predict the sales fro that particular store. Though, if the goal was to create a model able to accurately predict sales based only on regional data, such information should be ignored, to improve the model's generalization abilities.
Project structure¶
While the main goal was to predict the weekly sales of any store, a store classification model was also trained. As is, the section of the project are:
- Exploratory Data Analysis - considering the distribution of the factors, the correlations between them and judging their usability
- Engineering new date and change-related features, exploring trends, correlations
- Considering different dataset, model and preprocessing choices then testing them on function created in order to speed up experimentation
- Presenting results, judging performance, considering contributors, alternatives
- The classification problem
To get straight to the point, the results will be presented first, with the development techniques and explorations afterwards
Usage:¶
- As a GitHub Page - enjoy browsing through the results
- As an Interactive notebook - with all of the code cells visible, but dynamic plotly charts needed to be re-rendered [note: Cells under "Code Setup" will need to be run for interactive usage of later parts"]
Results¶
What's in the data?¶
Before considering training, it is important to explore the relations and distributions present within the data. For that purpose, we can look at the data manually, it's distribution shapes and the correlations between the factors.
As such, a sample of the dataset:
| Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
|---|---|---|---|---|---|---|---|---|
| 1938 | 14 | 12-08-2011 | 1928773.82 | 0 | 77.00 | 3.812 | 186.451211 | 8.625 |
| 2600 | 19 | 06-08-2010 | 1492060.89 | 0 | 74.20 | 2.942 | 132.614193 | 8.099 |
| 619 | 5 | 31-12-2010 | 298180.18 | 1 | 49.79 | 2.943 | 211.956714 | 6.768 |
| 3459 | 25 | 13-08-2010 | 686072.39 | 0 | 72.73 | 2.805 | 204.853378 | 7.527 |
| 4631 | 33 | 25-02-2011 | 242901.21 | 0 | 54.89 | 3.398 | 128.130000 | 8.951 |
The data statistics:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Store | 6435.0 | 2.300000e+01 | 12.988182 | 1.000 | 12.000 | 23.000000 | 3.400000e+01 | 4.500000e+01 |
| Weekly_Sales | 6435.0 | 1.046965e+06 | 564366.622054 | 209986.250 | 553350.105 | 960746.040000 | 1.420159e+06 | 3.818686e+06 |
| Holiday_Flag | 6435.0 | 6.993007e-02 | 0.255049 | 0.000 | 0.000 | 0.000000 | 0.000000e+00 | 1.000000e+00 |
| Temperature | 6435.0 | 6.066378e+01 | 18.444933 | -2.060 | 47.460 | 62.670000 | 7.494000e+01 | 1.001400e+02 |
| Fuel_Price | 6435.0 | 3.358607e+00 | 0.459020 | 2.472 | 2.933 | 3.445000 | 3.735000e+00 | 4.468000e+00 |
| CPI | 6435.0 | 1.715784e+02 | 39.356712 | 126.064 | 131.735 | 182.616521 | 2.127433e+02 | 2.272328e+02 |
| Unemployment | 6435.0 | 7.999151e+00 | 1.875885 | 3.879 | 6.891 | 7.874000 | 8.622000e+00 | 1.431300e+01 |
And the data distributions plots:
Note: to see interactive figures (and any like it), uncomment the above cells and run the notebook locally, as there seems to be an issue with the combination github, plotly and notebooks
What conclusions can we reach from these distributions?
- The stores are represented evenly
- Unsurprisingly, the count of days with holidays is greatly smaller than of those without
- The range of weekly sales is quite vast, which might make the training more challenging, but also more robust
- The range of temperatures is also quite vast, which, if chosen as a factor may improve model's robustness
- The distribution of fuel prices may indicate either a sudden change in fuel prices at a point in time within the dataset or a difference between the stores regions
- A much clearer evidence of this may be visible in the Consumer Price Index
- The unemployment rate looks like a hetly distrubution, with a small amount of outliers
For a further exploration of the distribution, particularly for CPI and fuel price, see: Exploring the distributions further later in the notebook
Looking at the relationship matrix of available variables:
- It's rather hard to make any substantial claims about any factor's relation to our predicted variable (weekly sales). No clear linear trends appear.
- Still, the distribution plots appear more promising, with a healthy amount of variance and not a lot of extreme values for any of the factors.
- The only possibly suspicious distribution is that of the consumer price index, with a clear distribution split, implying either a change in how it is calculated at some point in time, a sudden change in the US Dollar's value or some other event, which may not fare well for the model's generalizational skills.
- Nevertheless, it will be considrered as one of the factors to include in the prediction, as, how intuition suggests, it should strongly influence consumer spending habits.
Feature engineering¶
To understand the exact operations taken in-depth, see Datatype cleanup, extra feature engineering further in the notebook.
As a repeat from the introduction, the features added were:
- the month the week happened in
- the week's number within the month
- the week's number within the year
- the differences between the current week's circumstances and the previous week's
With the rationale being:
- The number of the week within a month - may be related to typical salary payoff in the regions of the US. If many customers received their salary in a particualar week, they might spend more on that week or the week after. Conversely, if they haven't received a salary in a while, they might purchase less in general.
- The month - average customer spending could be related to the presence of big holidays within a month, the spending over the year may show cycling tendencies etc.
- The week of the year could show similar tendencies
- The value of the previous week's metrics and the difference between the current and the previous week's values might show tendencies such as whether a change in fuel price, the customer price index or a change in weather could affect customer behaviour
Regression modelling / predicting weekly sales¶
To read up more about testing various model and dataset combinations, see Weekly sales regression modelling. In total, almost 100 models (when considered all hyperparameter and dataset combinations) were tested, out of which one was chosen as the best performant one
In the end the model chosen was a XGBoost model with 2000 estimators and a learning rate of 0.01.
The dataset used for predicting included both the date-related factors as well as the previous week's values and the differences between the current week's and previous values, obviously excluding the difference between current week's sales, but including previous week's sales.
| model | mae | median | r2 | score | score_by_max | score_by_min | score_by_mean | score_by_median | score_by_std | |
|---|---|---|---|---|---|---|---|---|---|---|
| 19 | xgb_week_prev_diff | 45318.707146 | 28028.2550 | 0.984884 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
| 47 | xgb_week_prev_diff | 44516.634006 | 27577.3225 | 0.984334 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
Measurement metrics¶
To judge the model, cross-validation across the whole dataset was conducted, as to prevent the accidental change of performance depending on the choice of testing datapoints.
With that, the score column represents the averaged mean absolute error cross-validated on the data set, in dollars.
To compare this results, let's look at the results of a baseline decision tree model trained on a dataset with no additional features, as well as the maximum, minimum, average, median and standard deviation values of the weekly sales recorded:
| max | min | avg | median | std | model_type | mae_error | mae_by_max | mae_by_min | mae_by_avg | mae_by_median | mae_by_std | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3818686.45 | 209986.25 | 1.046965e+06 | 960746.04 | 564366.62 | tree_base | 407330.46 | 10.67 | 193.98 | 38.91 | 42.4 | 72.17 |
The baseline model achieved an error of around 420k dollars, which constitutes about 40% of the average sales values for all the stores. Not usable by any means.
In contrast, looking at the best performing model's results
A lowest mean absolute error of 43 thousand dollars could prove very useful. When looking at the averaged cross-validation score however, it raises to about 63 thousand, which still is only equal to about 6% of the mean of weekly sales and a measly 1,6% of the maximum recorded sales. Especially considering the premise of this project - how none of the factors directly relate to the store sales, with the closest in theory being the Consumer Price Index.
Futher explorations of the model's performance in relation to input features are presented in Exploring the predictions / results further
Development¶
Dataset exploration¶
A quick look¶
df.sample(5)
| Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
|---|---|---|---|---|---|---|---|---|
| 2365 | 17 | 29-07-2011 | 861894.77 | 0 | 68.59 | 3.574 | 129.167710 | 6.745 |
| 3254 | 23 | 02-03-2012 | 1322852.20 | 0 | 25.91 | 3.827 | 137.423897 | 4.261 |
| 2650 | 19 | 22-07-2011 | 1377119.45 | 0 | 79.37 | 3.972 | 135.873839 | 7.806 |
| 2547 | 18 | 27-04-2012 | 961186.23 | 0 | 50.43 | 4.023 | 137.978133 | 8.304 |
| 6302 | 45 | 16-04-2010 | 782221.96 | 0 | 54.28 | 2.899 | 181.692477 | 8.899 |
Are there any missing values that need to be handled?¶
df.isnull().sum()
Store 0 Date 0 Weekly_Sales 0 Holiday_Flag 0 Temperature 0 Fuel_Price 0 CPI 0 Unemployment 0 dtype: int64
Make column names uniform¶
df.columns = list(map(lambda col: col.lower().replace(' ', '_'), df.columns))
Explore the distribution of the factors within the dataset¶
hist_fig = make_subplots(rows = 6, cols = 2,
subplot_titles=("Store number", "Is holiday?",
"Weekly sales",
"Temperature",
"Fuel price",
"CPI",
"Unemployment"),
specs = [
[{}, {}],
[{"colspan": 2}, None],
[{"colspan": 2}, None],
[{"colspan": 2}, None],
[{"colspan": 2}, None],
[{"colspan": 2}, None],
],
x_title = 'value',
y_title = 'count'
)
hist_fig.add_trace(
go.Histogram(x=df['store']),
row=1, col=1
)
hist_fig.add_trace(
go.Histogram(x=df['holiday_flag']),
row=1, col=2
)
hist_fig.add_trace(
go.Histogram(x=df['weekly_sales']),
row=2, col=1
)
hist_fig.add_trace(
go.Histogram(x=df['temperature']),
row=3, col=1
)
hist_fig.add_trace(
go.Histogram(x=df['fuel_price']),
row=4, col=1
)
hist_fig.add_trace(
go.Histogram(x=df['cpi']),
row=5, col=1
)
hist_fig.add_trace(
go.Histogram(x=df['unemployment']),
row=6, col=1
)
hist_fig.update_layout(
title = "Factor histograms",
# xaxis_title = "count",
# yaxis_title = "value",
height = 1000, width = 1200
)
# for i in range(1,5):
# fig['layout']['xaxis{}'.format(i)]['title']='Label X axis 1'
# fig['layout']['yaxis{}'.format(i)]['title']='Label X axis 2'
# hist_fig.update_layout(autosize=False)
What conclusions can we reach from these distributions?
- The stores are represented evenly
- Unsurprisingly, the count of days with holidays is greatly smaller than of those without
- The range of weekly sales is quite vast, which might make the training more challenging, but also more robust
- The range of temperatures is also quite vast, which, if chosen as a factor may improve model's robustness
- The distribution of fuel prices may indicate either a sudden change in fuel prices at a point in time within the dataset or a difference between the stores regions
- A much clearer evidence of this may be visible in the Consumer Price Index
- The unemployment rate looks like a hetly distrubution, with a small amount of outliers
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| store | 6435.0 | 2.300000e+01 | 12.988182 | 1.000 | 12.000 | 23.000000 | 3.400000e+01 | 4.500000e+01 |
| weekly_sales | 6435.0 | 1.046965e+06 | 564366.622054 | 209986.250 | 553350.105 | 960746.040000 | 1.420159e+06 | 3.818686e+06 |
| holiday_flag | 6435.0 | 6.993007e-02 | 0.255049 | 0.000 | 0.000 | 0.000000 | 0.000000e+00 | 1.000000e+00 |
| temperature | 6435.0 | 6.066378e+01 | 18.444933 | -2.060 | 47.460 | 62.670000 | 7.494000e+01 | 1.001400e+02 |
| fuel_price | 6435.0 | 3.358607e+00 | 0.459020 | 2.472 | 2.933 | 3.445000 | 3.735000e+00 | 4.468000e+00 |
| cpi | 6435.0 | 1.715784e+02 | 39.356712 | 126.064 | 131.735 | 182.616521 | 2.127433e+02 | 2.272328e+02 |
| unemployment | 6435.0 | 7.999151e+00 | 1.875885 | 3.879 | 6.891 | 7.874000 | 8.622000e+00 | 1.431300e+01 |
Exploring the distributions further¶
Looking at the source of the split in fuel price and CPI distrubutions
df_by_date = df.groupby(by=['date'], as_index='False').agg(date = ('date', 'first'), fuel_price = ('fuel_price', 'mean'),
cpi = ('cpi', 'mean'), unemployment = ('unemployment', 'mean'),
weekly_sales = ('weekly_sales', 'mean'))
df_by_date.columns
Index(['date', 'fuel_price', 'cpi', 'unemployment', 'weekly_sales'], dtype='object')
price_date_fig = make_subplots(specs=[[{"secondary_y": True}]])
price_trace = go.Line(
x=df_by_date['date'],
y=df_by_date['fuel_price'],
name = 'fuel price',
marker = dict(color='rgb(34,163,192)')
)
cpi_trace = go.Line(
x=df_by_date['date'],
y=df_by_date['cpi'],
name = 'CPI',
yaxis = 'y2'
)
price_date_fig.add_trace(price_trace)
price_date_fig.add_trace(cpi_trace, secondary_y=True)
price_date_fig.show()
C:\Users\User\Desktop\programowanie_web_etc\python_projects\ml_zoo\venv\lib\site-packages\plotly\graph_objs\_deprecations.py:378: DeprecationWarning: plotly.graph_objs.Line is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.scatter.Line - plotly.graph_objs.layout.shape.Line - etc.
There appears to be a very cyclic behaviour of both fuel price and CPI (averaged between stores) with not many which might explain the distribution split. Before making any conclusions, a look at at the relation between these factors and the store shoud be taken.
df_by_store = df.groupby(by=['store'], as_index='False').agg(store = ('store', 'first'), fuel_price = ('fuel_price', 'mean'),
cpi = ('cpi', 'mean'), unemployment = ('unemployment', 'mean'),
weekly_sales = ('weekly_sales', 'mean'))
# px.line(df_by_store, x='store', y='fuel_price', title="Fuel price by store")
price_store_fig = make_subplots(specs=[[{"secondary_y": True}]])
price_trace = go.Line(
x=df_by_store['store'],
y=df_by_store['fuel_price'],
name = 'fuel price',
marker = dict(color='rgb(34,163,192)')
)
cpi_trace = go.Line(
x=df_by_store['store'],
y=df_by_store['cpi'],
name = 'CPI',
yaxis = 'y2'
)
price_store_fig.add_trace(price_trace)
price_store_fig.add_trace(cpi_trace, secondary_y=True)
price_store_fig.show()
C:\Users\User\Desktop\programowanie_web_etc\python_projects\ml_zoo\venv\lib\site-packages\plotly\graph_objs\_deprecations.py:378: DeprecationWarning: plotly.graph_objs.Line is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.scatter.Line - plotly.graph_objs.layout.shape.Line - etc.
There seems to be a reverse relation between fuel price and customer price index for many of the stores.
df_by_store[df_by_store['cpi'] < 150]['store'].unique().shape
(22,)
df_by_store[df_by_store['fuel_price'] < 3.4]['store'].unique().shape
(24,)
Factor relationships¶
scatter_fig = px.scatter_matrix(df)
scatter_fig.update_layout(height = 1200, width = 1200)
scatter_fig.show()
- It's rather hard to make any substantial claims about any factor's relation to our predicted variable (weekly sales). No clear linear trends appear.
- Still, the distribution plots appear more promising, with a healthy amount of variance and not a lot of extreme values for any of the factors.
- The only possibly suspicious distribution is that of the consumer price index, with a clear distribution split, implying either a change in how it is calculated at some point in time, a sudden change in the US Dollar's value or some other event, which may not fare well for the model's generalizational skills.
- Nevertheless, it will be considrered as one of the factors to include in the prediction, as, how intuition suggests, it should strongly influence consumer spending habits.
Datatype cleanup, extra feature engineering¶
Extra features - week number, month, week of month¶
df.dtypes
df['date'] = pd.to_datetime(df['date'], format="%d-%m-%Y")
df['week_number'] = df['date'].dt.isocalendar().week
df['month'] = df['date'].dt.month
df['week_of_month'] = df.apply(
lambda row: pendulum.parse(row['date'].strftime('%Y-%m-%d')).week_of_month,
axis=1
)
Extra features - previous week's values and value changes¶
Making sure to not overlap values from between stores
df['prev_weekly_sales'] = df.sort_values(['store','date']).groupby(['store'])['weekly_sales'].shift()
df['prev_temperature'] = df.sort_values(['store','date']).groupby(['store'])['temperature'].shift()
df['prev_fuel_price'] = df.sort_values(['store','date']).groupby(['store'])['fuel_price'].shift()
df['prev_cpi'] = df.sort_values(['store','date']).groupby(['store'])['cpi'].shift()
df['prev_unemployment'] = df.sort_values(['store','date']).groupby(['store'])['unemployment'].shift()
df['prev_holiday_flag'] = df.sort_values(['store','date']).groupby(['store'])['holiday_flag'].shift()
df['prev_month'] = df.sort_values(['store','date']).groupby(['store'])['month'].shift()
df['prev_week_number'] = df.sort_values(['store','date']).groupby(['store'])['week_number'].shift()
df['prev_week_of_month'] = df.sort_values(['store','date']).groupby(['store'])['week_of_month'].shift()
# fillna first values of type
df['prev_weekly_sales'] = df['prev_weekly_sales'].fillna(df['weekly_sales'])
df['prev_temperature'] = df['prev_temperature'].fillna(df['temperature'])
df['prev_fuel_price'] = df['prev_fuel_price'].fillna(df['fuel_price'])
df['prev_cpi'] = df['prev_cpi'].fillna(df['cpi'])
df['prev_unemployment'] = df['prev_unemployment'].fillna(df['unemployment'])
df['prev_holiday_flag'] = df['prev_holiday_flag'].fillna(df['holiday_flag'])
df['prev_month'] = df['prev_month'].fillna(df['month'])
df['prev_week_number'] = df['prev_week_number'].fillna(df['week_number'])
df['prev_week_of_month'] = df['prev_week_of_month'].fillna(df['week_of_month'])
Calculate differences from previous week to current
prev_cols = ['weekly_sales', 'temperature', 'fuel_price', 'cpi', 'unemployment', 'holiday_flag']
for col in prev_cols:
df[f'{col}_diff'] = df[f'{col}'] - df[f'prev_{col}']
# scatter_2_cols = [f'prev_{col}' for col in prev_cols]
# scatter_fig_2 = px.scatter_matrix(df,
# dimensions = scatter_2_cols)
# scatter_fig_2.update_layout(height = 1400, width = 1400)
# scatter_fig_2.show()
Date-related factors and weekly sales¶
scatter_fig_week = px.scatter_matrix(df,
dimensions = ['week_number', 'month', 'week_of_month', 'weekly_sales'])
scatter_fig_week.update_layout(height = 1000, width = 1000)
scatter_fig_week.show()
It would seem that typically week 4 is the big spender week in many of the stores. A similar rise can be ovserved in months 11 and 12 (november and december), which include both Thanksgiving Day and Christmas Holidays, which can drive up sales. Worth considering is also the following january fall in sales.
Weekly sales regression modelling¶
With features prepared and their relationships explored, it may still be unclear how they might contribute to the prediction accuracy. Be it because of their vast variance, clear or unclear trends present in the data or uncaptured, indirect relations between the factors considered.
Even after that, there exists a vast range of techniques and model types that could perform differenly on supplied data, capturing different relations within it.
For those reasons, to make experimentation easier and faster, let's create a few helper functions, to allow for quick dataset, hypermarameter and model type choice swaps.
Baseline¶
Before diving further into exploring, let's establish the metrics against which we can judge model performance, for weekly sales, such a measure could be how the mean or median error of predictions compares to the minimum, maximum, mean and average weekly sale prices present in the dataset
max_sales = df_eng['weekly_sales'].max()
min_sales = df_eng['weekly_sales'].min()
avg_sales = df_eng['weekly_sales'].mean()
median_sales = df_eng['weekly_sales'].median()
std_sales = round(df_eng['weekly_sales'].std(),2)
sales_data = {'max': [max_sales], 'min': [min_sales],
'avg': [avg_sales], 'median': [median_sales],
'std': [std_sales]}
sales_df = pd.DataFrame(sales_data)
sales_df
| max | min | avg | median | std | |
|---|---|---|---|---|---|
| 0 | 3818686.45 | 209986.25 | 1.046965e+06 | 960746.04 | 564366.62 |
Baseline decision tree model¶
For a baseline performance, let's use a decision tree model with no hyperparameter tuning, only the base dataset features, no preprocessing and no cross-validation.
X_base = df_eng[['holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment', 'week_number', 'month',
'week_of_month']]
y_base = df_eng[['weekly_sales']]
X_base_train, X_base_test, y_base_train, y_base_test = train_test_split(
X_base, y_base, shuffle=True, train_size=0.8)
tree_model_base = DecisionTreeRegressor()
tree_model_base.fit(X_base_train, y_base_train)
tree_base_preds = tree_model_base.predict(X_base_test)
tree_base_error = mean_absolute_error(y_base_test, tree_base_preds)
tree_base_error
407330.46155400155
sales_df['model_type'] = 'tree_base'
sales_df['mae_error'] = round(tree_base_error,2)
sales_df['mae_by_max'] = round(sales_df['mae_error'] * 100 / sales_df['max'],2)
sales_df['mae_by_min'] = round(sales_df['mae_error'] * 100 / sales_df['min'],2)
sales_df['mae_by_avg'] = round(sales_df['mae_error'] *100 / sales_df['avg'],2)
sales_df['mae_by_median'] = round(sales_df['mae_error'] * 100 / sales_df['median'],2)
sales_df['mae_by_std'] = round(sales_df['mae_error'] * 100 / sales_df['std'],2)
sales_df
| max | min | avg | median | std | model_type | mae_error | mae_by_max | mae_by_min | mae_by_avg | mae_by_median | mae_by_std | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3818686.45 | 209986.25 | 1.046965e+06 | 960746.04 | 564366.62 | tree_base | 407330.46 | 10.67 | 193.98 | 38.91 | 42.4 | 72.17 |
sales_df.to_csv('../walmart_sales_baseline.csv')
With a mean prediction error of ~420k USD, the scale of error is quite large, as it stands as 200% of the minimum sales value and about 40% of mean sales. Making the predictions extremely unpredictable.
Uniform training function¶
The function will:
- take in model, dataset, parameters, training options
- scale / preprocess the data if specified
- train model
- cross-validate the model
- return results of mae, r2 score, median_error as a new row of dataframe collecting results from all types of models, datasets etc
- calculate the proportion of error to the max, min, average, median and std of weekly sales
def preprocess_and_train_regression(model, dataset, target_col,
standard_scale=True,
model_name = None):
print(f'model {model_name} started training')
X = dataset.drop([target_col], axis=1)
y = dataset[[target_col]]
X_train, X_valid, y_train, y_valid = train_test_split(X, y.values.ravel(), shuffle=True, train_size=0.8)
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64',
'UInt32', 'int32']]
# Preprocessing transformers
numerical_transformer = StandardScaler()
# combined preprocessor
preprocessor = ColumnTransformer(
transformers = [
('num', numerical_transformer, numerical_cols),
# ('cat', categorical_transformer, categorical_cols)
]
)
# Bundle preprocessing and modeling code in a pipeline
if standard_scale == True:
train_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', model)
])
else:
train_pipeline = Pipeline(steps=[
('model', model)
])
# Preprocessing of training data, fit model
train_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = train_pipeline.predict(X_valid)
mae = mean_absolute_error(y_valid, preds)
median = median_absolute_error(y_valid, preds)
r2 = r2_score(y_valid, preds)
# cross-validate and average the score
scores = -1 * cross_val_score(train_pipeline,
X, y.values.ravel(),
cv=5,
scoring='neg_mean_absolute_error')
score = scores.mean()
print(f'model {model_name} trained, score: {score}')
return mae, median, score, r2, train_pipeline
def model_results_to_dataframe(mae, median, r2, score, model_name, results_df):
new_row = {
'model': model_name,
'mae': mae,
'median': median,
'r2': r2,
'score': score,
'score_by_max': round(score * 100 / max_sales,2),
'score_by_min': round(score * 100 / min_sales,2),
'score_by_mean': round(score * 100 / avg_sales,2),
'score_by_median': round(score * 100 / median_sales,2),
'score_by_std': round(score * 100 / std_sales,2)
}
new_df = pd.DataFrame([new_row])
results_df = pd.concat([results_df, new_df], ignore_index=True)
return results_df
Define possible dataset combinations¶
# the base dataframe information the date
df_noeng = df[['store', 'weekly_sales', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment']]
# the base dataframe information the date and store
df_noeng_nostore = df[['weekly_sales', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment']]
# with additional information about week and month
df_week = df[['weekly_sales', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment', 'week_number', 'month',
'week_of_month']]
# with additional information about week and month AND data about the previous week
df_week_prev = df[['weekly_sales', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment', 'week_number', 'month',
'week_of_month', 'prev_weekly_sales', 'prev_temperature',
'prev_fuel_price', 'prev_cpi', 'prev_unemployment', 'prev_month',
'prev_week_number', 'prev_week_of_month']]
# with additional information about week and month AND data about the previous week AND the differences differences
df_week_prev_diff = df[['weekly_sales', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment', 'week_number', 'month',
'week_of_month', 'prev_weekly_sales', 'prev_temperature',
'prev_fuel_price', 'prev_cpi', 'prev_unemployment',
'temperature_diff', 'fuel_price_diff', 'cpi_diff', 'unemployment_diff',
'prev_holiday_flag', 'holiday_flag_diff', 'prev_month',
'prev_week_number', 'prev_week_of_month']]
# with differences, without holiday and temperature - least directly related factors
df_week_prev_diff_no_holiday = df[['weekly_sales',
'fuel_price', 'cpi', 'unemployment', 'week_number', 'month',
'week_of_month', 'prev_weekly_sales',
'prev_fuel_price', 'prev_cpi', 'prev_unemployment',
'fuel_price_diff', 'cpi_diff', 'unemployment_diff',
'prev_month',
'prev_week_number', 'prev_week_of_month']]
df_week_prev_diff_store = df[['store', 'weekly_sales', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment', 'week_number', 'month',
'week_of_month', 'prev_weekly_sales', 'prev_temperature',
'prev_fuel_price', 'prev_cpi', 'prev_unemployment',
'temperature_diff', 'fuel_price_diff', 'cpi_diff', 'unemployment_diff',
'prev_holiday_flag', 'holiday_flag_diff', 'prev_month',
'prev_week_number', 'prev_week_of_month']]
datasets = [df_noeng, df_noeng_nostore,
df_week, df_week_prev,
df_week_prev_diff, df_week_prev_diff_no_holiday,
df_week_prev_diff_store]
dataset_names = ['noeng', 'noeng_nostore',
'week', 'week_prev',
'week_prev_diff', 'week_prev_diff_no_holiday',
'week_prev_diff_store']
Decision Tree models¶
Let's try tuning tree models by setting the maximum leaf nodes, as setting the maximum depth, will stop only at that depth, whilst choosing a maximum number of leaf nodes will try to optimize for best results and potentially drop some branches and reach a further overall depth.
For baseline tree model performance, only the base dataset will be used.
def get_tree_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return mae
best_base_tree_mae = tree_base_error
for max_leaf_nodes in [5, 50, 200, 250, 350, 500, 1000, 1750, 2500, 5000]:
leaf_mae = get_tree_mae(max_leaf_nodes, X_base_train, X_base_test, y_base_train, y_base_test)
if leaf_mae < best_base_tree_mae:
best_base_tree_mae = leaf_mae
print(f'Max leaf nodes: {max_leaf_nodes}, MAE Error: {leaf_mae}')
Max leaf nodes: 5, MAE Error: 457452.490810993 Max leaf nodes: 50, MAE Error: 388297.4889768926 Max leaf nodes: 200, MAE Error: 334513.0288046381 Max leaf nodes: 250, MAE Error: 331655.10642164515 Max leaf nodes: 350, MAE Error: 337170.3925875377 Max leaf nodes: 500, MAE Error: 352344.10878721357 Max leaf nodes: 1000, MAE Error: 386410.2587788944 Max leaf nodes: 1750, MAE Error: 408961.7812092063 Max leaf nodes: 2500, MAE Error: 406427.8777142688 Max leaf nodes: 5000, MAE Error: 402622.1923983424
Max leaf nodes set as 200¶
Quickly assess the datasets¶
def assess_sales_dataset_tree(input_df):
input_X = input_df.drop(['weekly_sales'], axis=1)
input_y = input_df[['weekly_sales']]
X_train, X_test, y_train, y_test = train_test_split(input_X, input_y, shuffle=True, train_size=0.8)
model = DecisionTreeRegressor(max_leaf_nodes = 200)
model.fit(X_train, y_train)
preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
return mae
best_tree_dataset_mae = df[['weekly_sales']].max().values[0]
for df_index, dataset in enumerate(datasets):
df_mae = assess_sales_dataset_tree(dataset)
if df_mae < best_tree_dataset_mae:
best_tree_dataset_mae = df_mae
print(f'df: {dataset_names[df_index]}, MAE Error: {df_mae}')
df: noeng, MAE Error: 85575.07687713197 df: noeng_nostore, MAE Error: 342137.1568076557 df: week, MAE Error: 362853.3747665065 df: week_prev, MAE Error: 58572.653681035285 df: week_prev_diff, MAE Error: 66567.47175256797 df: week_prev_diff_no_holiday, MAE Error: 68478.39775016108 df: week_prev_diff_store, MAE Error: 62513.83436315416
Clearly, using the previous week information yields the best results¶
- What's interesting is how the removal of the 'store' information, had terrible consequences. Perhaps it is unsurprising - information about which store you are considering will probably strongly influence the sales prediction, as can be seen in the weekly sales distribution, which has quite a large range.
- The question is - do we want to use this information, for most accurate results or do we want to ignore it, to make the model more general and applicable to any store not present in the dataset, but rather limiting ourselves to only data about the region such potential store would reside in.
Be warned - data leakage danger¶
- To calculate the difference between previous week's sales and the current - you'd need current week's sales - which is why, it was important to remove this column from the datasets
- It might also be interesting to consider predicting without definitive weekly differences (as in being future-oriented) or with these values
XGBoost¶
Taking in our best perfoming dataset and using it with a more mathematically complex model, let's test for hyperparameter combinations
X = df_week_prev.drop(['weekly_sales'], axis=1)
y = df_week_prev[['weekly_sales']]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), shuffle=True, train_size=0.8)
def get_xgb_mae(n_estimators, learning_rate, train_X, val_X, train_y, val_y, early_stopping_rounds = 5):
model = XGBRegressor(n_estimators = n_estimators, learning_rate=learning_rate)
model.fit(train_X, train_y,
early_stopping_rounds=early_stopping_rounds,
eval_set=[(val_X, val_y)],
verbose=False)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return mae
Commented for sake of rerunning - intermediary step¶
# best_xgb_mae = y_train.max()
# for n_estimators in [100, 250, 500, 1000, 2000, 5000]:
# for learning_rate in [0.001, 0.01, 0.02, 0.05, 0.1]:
# xgb_mae = get_xgb_mae(n_estimators, learning_rate, X_train, X_test, y_train, y_test)
# if xgb_mae < best_xgb_mae:
# best_xgb_mae = xgb_mae
# print(f'n_estimators: {n_estimators}, learning_rate: {learning_rate}, MAE Error: {xgb_mae}')
Best XGB model: n_estimators = 2000, learning_rater = 0.01¶
SVR, SGDRegressor and good ol' LinearRegression¶
from sklearn.svm import SVR
svr_model = SVR()
svr_model.fit(X_train, y_train)
svr_preds = svr_model.predict(X_test)
svr_mae = mean_absolute_error(svr_preds, y_test)
svr_mae
472662.50840315357
from sklearn.linear_model import LinearRegression
svr_model = LinearRegression()
svr_model.fit(X_train, y_train)
svr_preds = svr_model.predict(X_test)
svr_mae = mean_absolute_error(svr_preds, y_test)
svr_mae
88194.75565251504
from sklearn.linear_model import SGDRegressor
svr_model = SGDRegressor()
svr_model.fit(X_train, y_train)
svr_preds = svr_model.predict(X_test)
svr_mae = mean_absolute_error(svr_preds, y_test)
svr_mae
1.3392158634442775e+20
Interestlingly, the simplest LinearRegression model performed quite well, while a stochastic gradient regressor got completely lost.
Using the uniform training function¶
prepare the models table¶
tree_model = DecisionTreeRegressor(max_leaf_nodes=200)
random_forest_model = RandomForestRegressor()
linear_model = LinearRegression()
xgb_model = XGBRegressor(n_estimators=2000, learning_rate=0.01)
models = [tree_model, random_forest_model, linear_model, xgb_model]
model_names = ['tree', 'random_forest', 'linear', 'xgb']
results_df = pd.DataFrame()
max_sales = df_eng['weekly_sales'].max()
min_sales = df_eng['weekly_sales'].min()
avg_sales = df_eng['weekly_sales'].mean()
median_sales = df_eng['weekly_sales'].median()
std_sales = round(df_eng['weekly_sales'].std(),2)
Warning - lenghty operation - commented for sake of rerunning¶
score_best = df_eng['weekly_sales'].max()
best_model = None
for d_index, dataset in enumerate(datasets):
for m_index, model in enumerate(models):
model_name = f'{model_names[m_index]}_{dataset_names[d_index]}'
# reset the models
tree_model = DecisionTreeRegressor(max_leaf_nodes=200)
random_forest_model = RandomForestRegressor()
linear_model = LinearRegression()
xgb_model = XGBRegressor(n_estimators=2000, learning_rate=0.01)
mae, median, score, r2, result_model = preprocess_and_train_regression(model = model,
dataset = dataset,
target_col = 'weekly_sales',
standard_scale=True,
model_name = model_name
)
results_df = model_results_to_dataframe(mae, median, r2, score, model_name, results_df)
if score < score_best:
best_model = result_model
model tree_noeng started training model tree_noeng trained, score: 595296.90074159 model random_forest_noeng started training model random_forest_noeng trained, score: 548620.7537072105 model linear_noeng started training model linear_noeng trained, score: 469524.0295420152 model xgb_noeng started training model xgb_noeng trained, score: 536853.2874361888 model tree_noeng_nostore started training model tree_noeng_nostore trained, score: 727987.4202729956 model random_forest_noeng_nostore started training model random_forest_noeng_nostore trained, score: 704653.6401096077 model linear_noeng_nostore started training model linear_noeng_nostore trained, score: 499858.3530696604 model xgb_noeng_nostore started training model xgb_noeng_nostore trained, score: 675775.4092566705 model tree_week started training model tree_week trained, score: 733433.8645438107 model random_forest_week started training model random_forest_week trained, score: 690858.4305932797 model linear_week started training model linear_week trained, score: 499710.50333800464 model xgb_week started training model xgb_week trained, score: 665544.4722310423 model tree_week_prev started training model tree_week_prev trained, score: 89947.89701373907 model random_forest_week_prev started training model random_forest_week_prev trained, score: 70892.63540660452 model linear_week_prev started training model linear_week_prev trained, score: 87029.58728653591 model xgb_week_prev started training model xgb_week_prev trained, score: 64512.623049436676 model tree_week_prev_diff started training model tree_week_prev_diff trained, score: 85196.99488435917 model random_forest_week_prev_diff started training model random_forest_week_prev_diff trained, score: 67673.66957765346 model linear_week_prev_diff started training model linear_week_prev_diff trained, score: 88700.55195475338 model xgb_week_prev_diff started training model xgb_week_prev_diff trained, score: 62394.92187325175 model tree_week_prev_diff_no_holiday started training model tree_week_prev_diff_no_holiday trained, score: 87770.46384643672 model random_forest_week_prev_diff_no_holiday started training model random_forest_week_prev_diff_no_holiday trained, score: 71091.42570237764 model linear_week_prev_diff_no_holiday started training model linear_week_prev_diff_no_holiday trained, score: 86273.63883414381 model xgb_week_prev_diff_no_holiday started training model xgb_week_prev_diff_no_holiday trained, score: 64371.13253651904 model tree_week_prev_diff_store started training model tree_week_prev_diff_store trained, score: 88125.48518744169 model random_forest_week_prev_diff_store started training model random_forest_week_prev_diff_store trained, score: 71808.18970893552 model linear_week_prev_diff_store started training model linear_week_prev_diff_store trained, score: 90781.41377110554 model xgb_week_prev_diff_store started training model xgb_week_prev_diff_store trained, score: 67456.06630448716
# dataset_names
# mae, median, score, r2, model = preprocess_and_train_regression(model = xgb_model,
# dataset = df_week_prev,
# target_col = 'weekly_sales',
# standard_scale=True,
# model_name = 'xgb_week_prev'
# )
# results_df = model_results_to_dataframe(mae, median, r2, score, 'xgb_week_prev', results_df)
results_df
| model | mae | median | r2 | score | score_by_max | score_by_min | score_by_mean | score_by_median | score_by_std | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tree_noeng | 84528.038441 | 49047.070625 | 0.909809 | 583240.541650 | 15.27 | 277.75 | 55.71 | 60.71 | 103.34 |
| 1 | random_forest_noeng | 75610.881902 | 37999.522400 | 0.930984 | 543197.222052 | 14.22 | 258.68 | 51.88 | 56.54 | 96.25 |
| 2 | linear_noeng | 432978.470718 | 393082.532723 | 0.155198 | 469524.029542 | 12.30 | 223.60 | 44.85 | 48.87 | 83.19 |
| 3 | xgb_noeng | 69251.615637 | 40109.565000 | 0.949283 | 536853.287436 | 14.06 | 255.66 | 51.28 | 55.88 | 95.12 |
| 4 | tree_noeng_nostore | 346101.347530 | 237083.921275 | 0.235136 | 727131.555843 | 19.04 | 346.28 | 69.45 | 75.68 | 128.84 |
| 5 | random_forest_noeng_nostore | 392013.727342 | 277347.167500 | 0.094972 | 705502.075501 | 18.47 | 335.98 | 67.39 | 73.43 | 125.01 |
| 6 | linear_noeng_nostore | 459003.152820 | 453239.231604 | 0.024257 | 499858.353070 | 13.09 | 238.04 | 47.74 | 52.03 | 88.57 |
| 7 | xgb_noeng_nostore | 361764.379940 | 286374.870000 | 0.237352 | 675775.409257 | 17.70 | 321.82 | 64.55 | 70.34 | 119.74 |
| 8 | tree_week | 339520.149355 | 252827.462414 | 0.287367 | 736816.914712 | 19.30 | 350.89 | 70.38 | 76.69 | 130.56 |
| 9 | random_forest_week | 398005.335912 | 265231.155650 | 0.027922 | 695067.244252 | 18.20 | 331.01 | 66.39 | 72.35 | 123.16 |
| 10 | linear_week | 478110.642099 | 461358.790059 | 0.030260 | 499710.503338 | 13.09 | 237.97 | 47.73 | 52.01 | 88.54 |
| 11 | xgb_week | 376152.281481 | 296225.595000 | 0.288413 | 665544.472231 | 17.43 | 316.95 | 63.57 | 69.27 | 117.93 |
| 12 | tree_week_prev | 62582.599060 | 37248.437429 | 0.967622 | 90782.860119 | 2.38 | 43.23 | 8.67 | 9.45 | 16.09 |
| 13 | random_forest_week_prev | 49559.057340 | 31465.890500 | 0.982871 | 71231.453334 | 1.87 | 33.92 | 6.80 | 7.41 | 12.62 |
| 14 | linear_week_prev | 84739.783086 | 47148.797813 | 0.925601 | 87029.587287 | 2.28 | 41.45 | 8.31 | 9.06 | 15.42 |
| 15 | xgb_week_prev | 42243.329593 | 27601.765000 | 0.986700 | 64512.623049 | 1.69 | 30.72 | 6.16 | 6.71 | 11.43 |
| 16 | tree_week_prev_diff | 65054.680999 | 39840.262672 | 0.960453 | 85515.924905 | 2.24 | 40.72 | 8.17 | 8.90 | 15.15 |
| 17 | random_forest_week_prev_diff | 49077.723189 | 29202.620300 | 0.978815 | 68778.867838 | 1.80 | 32.75 | 6.57 | 7.16 | 12.19 |
| 18 | linear_week_prev_diff | 92056.264625 | 44358.238305 | 0.881266 | 88700.551955 | 2.32 | 42.24 | 8.47 | 9.23 | 15.72 |
| 19 | xgb_week_prev_diff | 45318.707146 | 28028.255000 | 0.984884 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
| 20 | tree_week_prev_diff_no_holiday | 68270.489115 | 39101.768000 | 0.952310 | 90939.404071 | 2.38 | 43.31 | 8.69 | 9.47 | 16.11 |
| 21 | random_forest_week_prev_diff_no_holiday | 52319.604537 | 33482.040600 | 0.978943 | 70734.527329 | 1.85 | 33.69 | 6.76 | 7.36 | 12.53 |
| 22 | linear_week_prev_diff_no_holiday | 83014.877178 | 45714.968158 | 0.929812 | 86273.638834 | 2.26 | 41.09 | 8.24 | 8.98 | 15.29 |
| 23 | xgb_week_prev_diff_no_holiday | 44215.133311 | 27247.982500 | 0.984934 | 64371.132537 | 1.69 | 30.65 | 6.15 | 6.70 | 11.41 |
| 24 | tree_week_prev_diff_store | 59283.626623 | 37077.653333 | 0.973241 | 87082.603453 | 2.28 | 41.47 | 8.32 | 9.06 | 15.43 |
| 25 | random_forest_week_prev_diff_store | 50978.797249 | 32286.404700 | 0.981560 | 70372.906932 | 1.84 | 33.51 | 6.72 | 7.32 | 12.47 |
| 26 | linear_week_prev_diff_store | 94119.811323 | 48815.818406 | 0.894104 | 90781.413771 | 2.38 | 43.23 | 8.67 | 9.45 | 16.09 |
| 27 | xgb_week_prev_diff_store | 45553.661486 | 27465.400000 | 0.983515 | 67456.066304 | 1.77 | 32.12 | 6.44 | 7.02 | 11.95 |
| 28 | tree_noeng | 85983.615203 | 43302.174000 | 0.910808 | 595296.900742 | 15.59 | 283.49 | 56.86 | 61.96 | 105.48 |
| 29 | random_forest_noeng | 72428.620732 | 36519.208900 | 0.938044 | 548620.753707 | 14.37 | 261.27 | 52.40 | 57.10 | 97.21 |
| 30 | linear_noeng | 421304.677886 | 372452.738581 | 0.162814 | 469524.029542 | 12.30 | 223.60 | 44.85 | 48.87 | 83.19 |
| 31 | xgb_noeng | 67034.716314 | 37885.375000 | 0.949488 | 536853.287436 | 14.06 | 255.66 | 51.28 | 55.88 | 95.12 |
| 32 | tree_noeng_nostore | 334036.967555 | 212780.203333 | 0.267316 | 727987.420273 | 19.06 | 346.68 | 69.53 | 75.77 | 128.99 |
| 33 | random_forest_noeng_nostore | 376563.489741 | 223643.667600 | 0.090543 | 704653.640110 | 18.45 | 335.57 | 67.30 | 73.34 | 124.86 |
| 34 | linear_noeng_nostore | 464807.078464 | 449828.344397 | 0.026641 | 499858.353070 | 13.09 | 238.04 | 47.74 | 52.03 | 88.57 |
| 35 | xgb_noeng_nostore | 378340.086334 | 315233.380000 | 0.255569 | 675775.409257 | 17.70 | 321.82 | 64.55 | 70.34 | 119.74 |
| 36 | tree_week | 345759.717913 | 256729.129583 | 0.292672 | 733433.864544 | 19.21 | 349.28 | 70.05 | 76.34 | 129.96 |
| 37 | random_forest_week | 397932.437374 | 262895.990300 | 0.053982 | 690858.430593 | 18.09 | 329.00 | 65.99 | 71.91 | 122.41 |
| 38 | linear_week | 462281.497432 | 451581.314085 | 0.016857 | 499710.503338 | 13.09 | 237.97 | 47.73 | 52.01 | 88.54 |
| 39 | xgb_week | 362318.597143 | 289940.170000 | 0.302927 | 665544.472231 | 17.43 | 316.95 | 63.57 | 69.27 | 117.93 |
| 40 | tree_week_prev | 62265.487172 | 37797.885096 | 0.970039 | 89947.897014 | 2.36 | 42.84 | 8.59 | 9.36 | 15.94 |
| 41 | random_forest_week_prev | 50042.854434 | 30185.254700 | 0.979143 | 70892.635407 | 1.86 | 33.76 | 6.77 | 7.38 | 12.56 |
| 42 | linear_week_prev | 86800.869250 | 45113.785953 | 0.911358 | 87029.587287 | 2.28 | 41.45 | 8.31 | 9.06 | 15.42 |
| 43 | xgb_week_prev | 46000.865875 | 27596.506250 | 0.981093 | 64512.623049 | 1.69 | 30.72 | 6.16 | 6.71 | 11.43 |
| 44 | tree_week_prev_diff | 65204.758819 | 40904.554268 | 0.965813 | 85196.994884 | 2.23 | 40.57 | 8.14 | 8.87 | 15.10 |
| 45 | random_forest_week_prev_diff | 50794.709573 | 30599.649300 | 0.979589 | 67673.669578 | 1.77 | 32.23 | 6.46 | 7.04 | 11.99 |
| 46 | linear_week_prev_diff | 86962.744052 | 47034.706178 | 0.904665 | 88700.551955 | 2.32 | 42.24 | 8.47 | 9.23 | 15.72 |
| 47 | xgb_week_prev_diff | 44516.634006 | 27577.322500 | 0.984334 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
| 48 | tree_week_prev_diff_no_holiday | 64271.138099 | 38094.548929 | 0.967965 | 87770.463846 | 2.30 | 41.80 | 8.38 | 9.14 | 15.55 |
| 49 | random_forest_week_prev_diff_no_holiday | 51963.058157 | 32239.973800 | 0.978521 | 71091.425702 | 1.86 | 33.86 | 6.79 | 7.40 | 12.60 |
| 50 | linear_week_prev_diff_no_holiday | 89256.832175 | 45856.204746 | 0.912547 | 86273.638834 | 2.26 | 41.09 | 8.24 | 8.98 | 15.29 |
| 51 | xgb_week_prev_diff_no_holiday | 45099.909304 | 28315.090000 | 0.984759 | 64371.132537 | 1.69 | 30.65 | 6.15 | 6.70 | 11.41 |
| 52 | tree_week_prev_diff_store | 58768.637712 | 36254.872752 | 0.973139 | 88125.485187 | 2.31 | 41.97 | 8.42 | 9.17 | 15.61 |
| 53 | random_forest_week_prev_diff_store | 51934.530629 | 32126.186900 | 0.979371 | 71808.189709 | 1.88 | 34.20 | 6.86 | 7.47 | 12.72 |
| 54 | linear_week_prev_diff_store | 87151.829883 | 51733.745385 | 0.921927 | 90781.413771 | 2.38 | 43.23 | 8.67 | 9.45 | 16.09 |
| 55 | xgb_week_prev_diff_store | 41085.150129 | 25151.431250 | 0.985960 | 67456.066304 | 1.77 | 32.12 | 6.44 | 7.02 | 11.95 |
results_df.to_csv('../walmart_sales_results.csv')
results_df.sort_values(by=['mae'], ascending=True).head()
| model | mae | median | r2 | score | score_by_max | score_by_min | score_by_mean | score_by_median | score_by_std | |
|---|---|---|---|---|---|---|---|---|---|---|
| 55 | xgb_week_prev_diff_store | 41085.150129 | 25151.43125 | 0.985960 | 67456.066304 | 1.77 | 32.12 | 6.44 | 7.02 | 11.95 |
| 15 | xgb_week_prev | 42243.329593 | 27601.76500 | 0.986700 | 64512.623049 | 1.69 | 30.72 | 6.16 | 6.71 | 11.43 |
| 23 | xgb_week_prev_diff_no_holiday | 44215.133311 | 27247.98250 | 0.984934 | 64371.132537 | 1.69 | 30.65 | 6.15 | 6.70 | 11.41 |
| 47 | xgb_week_prev_diff | 44516.634006 | 27577.32250 | 0.984334 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
| 51 | xgb_week_prev_diff_no_holiday | 45099.909304 | 28315.09000 | 0.984759 | 64371.132537 | 1.69 | 30.65 | 6.15 | 6.70 | 11.41 |
results_df.sort_values(by=['score'], ascending=True).head()
| model | mae | median | r2 | score | score_by_max | score_by_min | score_by_mean | score_by_median | score_by_std | |
|---|---|---|---|---|---|---|---|---|---|---|
| 47 | xgb_week_prev_diff | 44516.634006 | 27577.32250 | 0.984334 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
| 19 | xgb_week_prev_diff | 45318.707146 | 28028.25500 | 0.984884 | 62394.921873 | 1.63 | 29.71 | 5.96 | 6.49 | 11.06 |
| 51 | xgb_week_prev_diff_no_holiday | 45099.909304 | 28315.09000 | 0.984759 | 64371.132537 | 1.69 | 30.65 | 6.15 | 6.70 | 11.41 |
| 23 | xgb_week_prev_diff_no_holiday | 44215.133311 | 27247.98250 | 0.984934 | 64371.132537 | 1.69 | 30.65 | 6.15 | 6.70 | 11.41 |
| 43 | xgb_week_prev | 46000.865875 | 27596.50625 | 0.981093 | 64512.623049 | 1.69 | 30.72 | 6.16 | 6.71 | 11.43 |
A lowest mean absolute error of 43 thousand dollars is very impressive indeed. When looking at the averaged cross-validation score however, it raises to about 63 thousand, which still is only equal to about 6% of the mean of weekly sales and a measly 1,6% of the maximum recorded sales. Especially considering the premise of this project - how none of the factors directly relate to the store sales, with the closest in theory being the Consumer Price Index.
best_model_results = results_df[results_df['model'] == 'xgb_week_prev_diff']
Exploring the predictions / results further¶
- Are there any stores the model performs better or worse on?
- Are good or bad predictions related to some of the factors?
# using the dataset with store information for additional grouping for analysis, but training without it
df_week_prev_diff_store
| store | weekly_sales | holiday_flag | temperature | fuel_price | cpi | unemployment | week_number | month | week_of_month | ... | prev_unemployment | temperature_diff | fuel_price_diff | cpi_diff | unemployment_diff | prev_holiday_flag | holiday_flag_diff | prev_month | prev_week_number | prev_week_of_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1643690.90 | 0 | 42.31 | 2.572 | 211.096358 | 8.106 | 5 | 2 | 1 | ... | 8.106 | 0.00 | 0.000 | 0.000000 | 0.000 | 0.0 | 0.0 | 2.0 | 5 | 1.0 |
| 1 | 1 | 1641957.44 | 1 | 38.51 | 2.548 | 211.242170 | 8.106 | 6 | 2 | 2 | ... | 8.106 | -3.80 | -0.024 | 0.145812 | 0.000 | 0.0 | 1.0 | 2.0 | 5 | 1.0 |
| 2 | 1 | 1611968.17 | 0 | 39.93 | 2.514 | 211.289143 | 8.106 | 7 | 2 | 3 | ... | 8.106 | 1.42 | -0.034 | 0.046973 | 0.000 | 1.0 | -1.0 | 2.0 | 6 | 2.0 |
| 3 | 1 | 1409727.59 | 0 | 46.63 | 2.561 | 211.319643 | 8.106 | 8 | 2 | 4 | ... | 8.106 | 6.70 | 0.047 | 0.030500 | 0.000 | 0.0 | 0.0 | 2.0 | 7 | 3.0 |
| 4 | 1 | 1554806.68 | 0 | 46.50 | 2.625 | 211.350143 | 8.106 | 9 | 3 | 1 | ... | 8.106 | -0.13 | 0.064 | 0.030500 | 0.000 | 0.0 | 0.0 | 2.0 | 8 | 4.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6430 | 45 | 713173.95 | 0 | 64.88 | 3.997 | 192.013558 | 8.684 | 39 | 9 | 5 | ... | 8.684 | -0.44 | -0.041 | 0.156854 | 0.000 | 0.0 | 0.0 | 9.0 | 38 | 4.0 |
| 6431 | 45 | 733455.07 | 0 | 64.89 | 3.985 | 192.170412 | 8.667 | 40 | 10 | 1 | ... | 8.684 | 0.01 | -0.012 | 0.156854 | -0.017 | 0.0 | 0.0 | 9.0 | 39 | 5.0 |
| 6432 | 45 | 734464.36 | 0 | 54.47 | 4.000 | 192.327265 | 8.667 | 41 | 10 | 2 | ... | 8.667 | -10.42 | 0.015 | 0.156854 | 0.000 | 0.0 | 0.0 | 10.0 | 40 | 1.0 |
| 6433 | 45 | 718125.53 | 0 | 56.47 | 3.969 | 192.330854 | 8.667 | 42 | 10 | 3 | ... | 8.667 | 2.00 | -0.031 | 0.003589 | 0.000 | 0.0 | 0.0 | 10.0 | 41 | 2.0 |
| 6434 | 45 | 760281.43 | 0 | 58.85 | 3.882 | 192.308899 | 8.667 | 43 | 10 | 4 | ... | 8.667 | 2.38 | -0.087 | -0.021955 | 0.000 | 0.0 | 0.0 | 10.0 | 42 | 3.0 |
6435 rows × 24 columns
best_model
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', StandardScaler(),
['store', 'holiday_flag',
'temperature', 'fuel_price',
'cpi', 'unemployment',
'week_number', 'month',
'week_of_month',
'prev_weekly_sales',
'prev_temperature',
'prev_fuel_price',
'prev_cpi',
'prev_unemployment',
'temperature_diff',
'fuel_price_diff',
'cpi_diff',
'unemployment_di...
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=0.01,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=2000, n_jobs=None,
num_parallel_tree=None, random_state=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num', StandardScaler(),
['store', 'holiday_flag',
'temperature', 'fuel_price',
'cpi', 'unemployment',
'week_number', 'month',
'week_of_month',
'prev_weekly_sales',
'prev_temperature',
'prev_fuel_price',
'prev_cpi',
'prev_unemployment',
'temperature_diff',
'fuel_price_diff',
'cpi_diff',
'unemployment_di...
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=0.01,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=2000, n_jobs=None,
num_parallel_tree=None, random_state=None, ...))])ColumnTransformer(transformers=[('num', StandardScaler(),
['store', 'holiday_flag', 'temperature',
'fuel_price', 'cpi', 'unemployment',
'week_number', 'month', 'week_of_month',
'prev_weekly_sales', 'prev_temperature',
'prev_fuel_price', 'prev_cpi',
'prev_unemployment', 'temperature_diff',
'fuel_price_diff', 'cpi_diff',
'unemployment_diff', 'prev_holiday_flag',
'holiday_flag_diff', 'prev_month',
'prev_week_number', 'prev_week_of_month'])])['store', 'holiday_flag', 'temperature', 'fuel_price', 'cpi', 'unemployment', 'week_number', 'month', 'week_of_month', 'prev_weekly_sales', 'prev_temperature', 'prev_fuel_price', 'prev_cpi', 'prev_unemployment', 'temperature_diff', 'fuel_price_diff', 'cpi_diff', 'unemployment_diff', 'prev_holiday_flag', 'holiday_flag_diff', 'prev_month', 'prev_week_number', 'prev_week_of_month']
StandardScaler()
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=2000, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)import pickle
best_model_pickle = pickle.dumps(best_model)
Testing the best performing model on a random sample¶
test_sample = df_week_prev_diff_store.sample(frac=0.2)
X = test_sample.drop(['weekly_sales'], axis=1)
y = test_sample[['weekly_sales']]
# X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=0.8)
test_sample['predicted_sales'] = best_model.predict(X)
test_sample['error'] = abs(test_sample['weekly_sales'] - test_sample['predicted_sales'])
test_sample.columns
Index(['store', 'weekly_sales', 'holiday_flag', 'temperature', 'fuel_price',
'cpi', 'unemployment', 'week_number', 'month', 'week_of_month',
'prev_weekly_sales', 'prev_temperature', 'prev_fuel_price', 'prev_cpi',
'prev_unemployment', 'temperature_diff', 'fuel_price_diff', 'cpi_diff',
'unemployment_diff', 'prev_holiday_flag', 'holiday_flag_diff',
'prev_month', 'prev_week_number', 'prev_week_of_month',
'predicted_sales', 'error'],
dtype='object')
Grouping the sample results by store¶
Can we learn some insights about which feature contributed the most to the error? Or what kind of stores (by regional metrics) does our model excel in predicting values for or struggle with?
test_sample_by_store = test_sample.groupby(['store'], as_index=False).agg(
store = ('store', 'first'),
average_error = ('error', 'mean'),
average_sales = ('weekly_sales', 'mean'),
average_prediction = ('predicted_sales', 'mean'),
average_temp = ('temperature', 'mean'),
average_fuel_price = ('fuel_price', 'mean'),
average_cpi = ('fuel_price', 'mean'),
average_unemployment = ('unemployment', 'mean')
)
test_sample['error'].sort_values()
0 34.2750
3867 55.1950
4934 60.4450
3453 95.8700
3877 97.8750
...
3723 240147.1200
2319 305923.5925
2513 319462.4975
2514 386265.0475
1860 698753.6600
Name: error, Length: 1287, dtype: float64
best_stores = test_sample_by_store.sort_values(by=['average_error'], ascending=True).head()
worst_stores = test_sample_by_store.sort_values(by=['average_error'], ascending=True).tail()
Average error by store¶
px.bar(test_sample_by_store, x='store', y='average_error')
Best and worst stores comparison¶
best_worst_fig = make_subplots(
x_title='store',
y_title='average error'
)
best_worst_fig.add_trace(
go.Bar(
x=best_stores['store'],
y=best_stores['average_error'],
name = 'best stores',
marker = dict(color='rgb(34,163,192)')
)
)
best_worst_fig.add_trace(
go.Bar(
x=worst_stores['store'],
y=worst_stores['average_error'],
name = 'worst stores',
),
)
best_worst_fig.show()
best_stores
| store | average_error | average_sales | average_prediction | average_temp | average_fuel_price | average_cpi | average_unemployment | |
|---|---|---|---|---|---|---|---|---|
| 4 | 5 | 10106.374474 | 315419.382105 | 321278.21875 | 67.115789 | 3.224211 | 3.224211 | 6.170842 |
| 29 | 30 | 10706.563831 | 440733.381290 | 438260.87500 | 67.753548 | 3.264742 | 3.264742 | 7.560548 |
| 37 | 38 | 11322.460286 | 387272.607143 | 389599.21875 | 66.412286 | 3.529657 | 3.529657 | 13.143543 |
| 35 | 36 | 11569.062500 | 368190.842857 | 364764.84375 | 72.787143 | 3.260886 | 3.260886 | 7.820429 |
| 43 | 44 | 12021.571216 | 306873.114865 | 311331.03125 | 56.275405 | 3.400000 | 3.400000 | 6.492162 |
What characterizes the best and worst performing stores?¶
As in what are their characteristic values, on average, how do the best and worst stores differ, in what areas
best_stores_avg = best_stores.mean()
best_stores_avg
store 30.600000 average_error 11145.206461 average_sales 363697.865652 average_prediction 365046.843750 average_temp 66.068834 average_fuel_price 3.335899 average_cpi 3.335899 average_unemployment 8.237505 dtype: float64
worst_stores_avg = worst_stores.mean()
worst_stores_avg
store 1.560000e+01 average_error 5.169260e+04 average_sales 1.670876e+06 average_prediction 1.672463e+06 average_temp 5.952750e+01 average_fuel_price 3.394260e+00 average_cpi 3.394260e+00 average_unemployment 7.851085e+00 dtype: float64
best_worst_avg_diff = best_stores_avg - worst_stores_avg
best_worst_avg_diff
store 1.500000e+01 average_error -4.054740e+04 average_sales -1.307178e+06 average_prediction -1.307416e+06 average_temp 6.541331e+00 average_fuel_price -5.836059e-02 average_cpi -5.836059e-02 average_unemployment 3.864199e-01 dtype: float64
px.bar(best_worst_avg_diff.drop(['store']))
px.bar(best_worst_avg_diff.drop(['store', 'average_error', 'average_sales', 'average_prediction']))
It seems our model seems to perform better on stores from generally warmer regions. This could just be an error due to the sample chosen, but we don't want to predict on the whole dataset again, to avoid predicting on training data. Still, a difference of around 6 degrees seems hard to ignore.
px.scatter(test_sample, x='temperature', y='error')
Reasonings / Decisions made¶
Why cross-validate¶
Clearly, there is some variance in the model's performance, based on the data sample - a model might perform exceptionally well on some portion of the dataset, while on average, while validating on the whole dataset, it perform a worse, even significantly. It's worth being aware of the worst case scenarios / the range of performance for any model, to accurately judge it's usefullness, especially in cases like the dataset uses, where the predicted variable and the factors used may vary significantly, based on the store / the region of the store.
Standard scaling¶
One important issue models may run into while training, the user needs to be aware of is the scale of the values of the factors used. For example, for a region with many customers needing to drive to get to stores, a slight change in fuel price, might influence their decision more the change in the Consumer Price Index. But whereas a 100% increase for fuel prices would in some regions equal to approximately 2-3 dollars, a 100% CPI change, might revolve around a raw numerical value increase of 100 - even though the relative percentage change remained the same, the factor changed about 50 times more. With that in mind, it might prove useful to consider scaling all of the factors to a set range of values, let's say 0-1, so that the relationships of values within factors are kept, but no factor overshadows other.
In the case of the problem at hand, when factors come from relatively different domains standard scaling may be especially useful, because of the lack of domain relationship. If they came from similar domains, say the average price of fruit and the average price of appliances for that week, even if the scales of these factors were quite different, the relation between their values might be of interest to us and the model